Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.0 - Check here for latest version

Read Office Files (Operator Toolbox)

Synopsis

This operator reads a Microsoft Word .doc or .docx file as well as Microsoft PowerPoint .ppt and .pttx file and converts it to a document object.

Description

This operator is used to read a Microsoft Word .doc or .docx file as well as Microsoft PowerPoint .ppt and .pttx and converts it to document object. Document objects are used primarily in the Text Processing extension for text mining projects. You can connect a file object to the file input port (fil), or specify a file path in the parameters panel. File objects can be created with other operators with file output ports such as Open File or Loop Files.

By default the operator uses the file extension provided to detect if the file is doc, docx, ppt or pptx. This identification can be overwritten using the detect_file_type settings. In this case the user needs to provide the file type with the file_extension setting.

Input

  • fil (File)

    A document object containing the text of the original Microsoft Office file. Note that this excludes text from headers, footers, comments, or other text not contained in normal 'paragraphs' as defined by Microsoft.

Output

  • doc

    The document with the text of the word or powerpoint file.

Parameters

  • file The path of the Microsoft Office file is specified here. Range:
  • detect_file_type If this parameter is set to true, the file name is used to determine whether the file is a .doc or .docx file type. Range: boolean
  • file_extension The file type (must be either '.doc','.docx','ppt' or 'pptx'). This parameter is used only if detect_file_type is set to false. Range:

Tutorial Processes